New Tools for Web-Scale N-grams

نویسندگان

  • Dekang Lin
  • Kenneth Ward Church
  • Heng Ji
  • Satoshi Sekine
  • David Yarowsky
  • Shane Bergsma
  • Kailash Patil
  • Emily Pitler
  • Rachel Lathbury
  • Vikram Rao
  • Kapil Dalwani
  • Sushant Narsale
چکیده

We introduce a new set of tools for working with web-scale N-gram data. These tools lower the barrier for working with web-scale text, and create a new platform for acquiring large-scale linguistic knowledge. They will allow novel sources of information to be applied to long-standing natural language challenges.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Web-scale N-grams to Improve Base NP Parsing Performance

We use web-scale N-grams in a base NP parser that correctly analyzes 95.4% of the base NPs in natural text. Web-scale data improves performance. That is, there is no data like more data. Performance scales log-linearly with the number of parameters in the model (the number of unique N-grams). The web-scale N-grams are particularly helpful in harder cases, such as NPs that contain conjunctions.

متن کامل

Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that including N-gram count features can advance the state-of-the-art accuracy on standard data sets for...

متن کامل

The StringNet Lexico-Grammatical Knowledgebase and its Applications

This demo introduces a suite of web-based English lexical knowledge resources, called StringNet and StringNet Navigator (http://nav.stringnet.org), designed to provide access to the immense territory of multiword expressions that falls between what the lexical entries encode in lexicons on the one hand and what productive grammar rules cover on the other. StringNet’s content consists of 1.6 bil...

متن کامل

SCALE: A Scalable Language Engineering Toolkit

In this paper we present SCALE, a new Python toolkit that contains two extensions to n-gram language models. The first extension is a novel technique to model compound words called Semantic Head Mapping (SHM). The second extension, Bag-of-Words Language Modeling (BagLM), bundles popular models such as Latent Semantic Analysis and Continuous Skip-grams. Both extensions scale to large data and al...

متن کامل

Gender and Animacy Knowledge Discovery from Web-Scale N-Grams for Unsupervised Person Mention Detection

In this paper we present a simple approach to discover gender and animacy knowledge for person mention detection. We learn noun-gender and noun-animacy pair counts from web-scale n-grams using specific lexical patterns, and then apply confidence estimation metrics to filter noise. The selected informative pairs are then used to detect person mentions from raw texts in an unsupervised learning f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010